Goto

Collaborating Authors

 claim amount


Transparent, Evaluable, and Accessible Data Agents: A Proof-of-Concept Framework

Bahador, Nooshin

arXiv.org Artificial Intelligence

This article presents a modular, component-based architecture for developing and evaluating AI agents that bridge the gap between natural language interfaces and complex enterprise data warehouses. The system directly addresses core challenges in data accessibility by enabling non-technical users to interact with complex data warehouses through a conversational interface, translating ambiguous user intent into precise, executable database queries to overcome semantic gaps. A cornerstone of the design is its commitment to transparent decision-making, achieved through a multi-layered reasoning framework that explains the "why" behind every decision, allowing for full interpretability by tracing conclusions through specific, activated business rules and data points. The architecture integrates a robust quality assurance mechanism via an automated evaluation framework that serves multiple functions: it enables performance benchmarking by objectively measuring agent performance against golden standards, and it ensures system reliability by automating the detection of performance regressions during updates. The agent's analytical depth is enhanced by a statistical context module, which quantifies deviations from normative behavior, ensuring all conclusions are supported by quantitative evidence including concrete data, percentages, and statistical comparisons. We demonstrate the efficacy of this integrated agent-development-with-evaluation framework through a case study on an insurance claims processing system. The agent, built on a modular architecture, leverages the BigQuery ecosystem to perform secure data retrieval, apply domain-specific business rules, and generate human-auditable justifications. The results confirm that this approach creates a robust, evaluable, and trustworthy system for deploying LLM-powered agents in data-sensitive, high-stakes domains.


Distribution-free inference for LightGBM and GLM with Tweedie loss

Manna, Alokesh, Sett, Aditya Vikram, Dey, Dipak K., Gu, Yuwen, Schifano, Elizabeth D., He, Jichao

arXiv.org Machine Learning

Prediction uncertainty quantification is a key research topic in recent years scientific and business problems. In insurance industries (\cite{parodi2023pricing}), assessing the range of possible claim costs for individual drivers improves premium pricing accuracy. It also enables insurers to manage risk more effectively by accounting for uncertainty in accident likelihood and severity. In the presence of covariates, a variety of regression-type models are often used for modeling insurance claims, ranging from relatively simple generalized linear models (GLMs) to regularized GLMs to gradient boosting models (GBMs). Conformal predictive inference has arisen as a popular distribution-free approach for quantifying predictive uncertainty under relatively weak assumptions of exchangeability, and has been well studied under the classic linear regression setting. In this work, we propose new non-conformity measures for GLMs and GBMs with GLM-type loss. Using regularized Tweedie GLM regression and LightGBM with Tweedie loss, we demonstrate conformal prediction performance with these non-conformity measures in insurance claims data. Our simulation results favor the use of locally weighted Pearson residuals for LightGBM over other methods considered, as the resulting intervals maintained the nominal coverage with the smallest average width.


Backdoor attacks on DNN and GBDT -- A Case Study from the insurance domain

Kühlem, Robin, Otten, Daniel, Ludwig, Daniel, Hudde, Anselm, Rosenbaum, Alexander, Mauthe, Andreas

arXiv.org Artificial Intelligence

Machine learning (ML) will likely play a large role in many processes in the future, also for insurance companies. However, ML models are at risk of being attacked and manipulated. In this work, the robustness of Gradient Boosted Decision Tree (GBDT) models and Deep Neural Networks (DNN) within an insurance context will be evaluated. Therefore, two GBDT models and two DNNs are trained on two different tabular datasets from an insurance context. Past research in this domain mainly used homogenous data and there are comparably few insights regarding heterogenous tabular data. The ML tasks performed on the datasets are claim prediction (regression) and fraud detection (binary classification). For the backdoor attacks different samples containing a specific pattern were crafted and added to the training data. It is shown, that this type of attack can be highly successful, even with a few added samples. The backdoor attacks worked well on the models trained on one dataset but poorly on the models trained on the other. In real-world scenarios the attacker will have to face several obstacles but as attacks can work with very few added samples this risk should be evaluated.


Combining Structural and Unstructured Data: A Topic-based Finite Mixture Model for Insurance Claim Prediction

Hou, Yanxi, Xia, Xiaolan, Gao, Guangyuan

arXiv.org Artificial Intelligence

Modeling insurance claim amounts and classifying claims into different risk levels are critical yet challenging tasks. Traditional predictive models for insurance claims often overlook the valuable information embedded in claim descriptions. This paper introduces a novel approach by developing a joint mixture model that integrates both claim descriptions and claim amounts. Our method establishes a probabilistic link between textual descriptions and loss amounts, enhancing the accuracy of claims clustering and prediction. In our proposed model, the latent topic/component indicator serves as a proxy for both the thematic content of the claim description and the component of loss distributions. Specifically, conditioned on the topic/component indicator, the claim description follows a multinomial distribution, while the claim amount follows a component loss distribution. We propose two methods for model calibration: an EM algorithm for maximum a posteriori estimates, and an MH-within-Gibbs sampler algorithm for the posterior distribution. The empirical study demonstrates that the proposed methods work effectively, providing interpretable claims clustering and prediction.


Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss Analytics

So, Banghee, Valdez, Emiliano A.

arXiv.org Machine Learning

In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims. Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods in order to help leverage an iterative process to enhance predictive accuracy. Despite the inherent slowdown in learning algorithms due to this iteration, several efficient implementation techniques that also help precise tuning of parameter like XGBoost, LightGBM, and CatBoost have emerged. Nonetheless, we chose to utilize CatBoost, a efficient boosting approach that effectively handles categorical and other special types of data. The core contribution of our paper is the assembly of separate modeling for zero claims and the application of tree-based boosting ensemble methods within a CatBoost framework, assuming that the inflated probability of zero is a function of the mean parameter. The efficacy of our enhanced Tweedie model is demonstrated through the application of an insurance telematics dataset, which presents the additional complexity of compositional feature variables. Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions suitable for insurance claim analytics.


Frequency-Severity Experience Rating based on Latent Markovian Risk Profiles

Verschuren, Robert Matthijs

arXiv.org Artificial Intelligence

Bonus-Malus Systems (BMSs) are nowadays widely employed in automobile insurance to dynamically adjust a premium based on a customer's claims experience. The intuition behind these posterior ratemaking systems is that as we observe more claiming behavior, we learn more about the underlying risk profile. These systems are therefore a commercially attractive form of experience rating, in which we correct the prior premium for past claims to reflect our updated beliefs about a customer's risk profile. Moreover, they traditionally consider a customer's number of claims irrespective of their sizes and thus implicitly assume independence between the claim counts and sizes (Hey, 1970; Denuit et al., 2007; Boucher and Inoussa, 2014; Verschuren, 2021). Alternative Bayesian forms of experience rating typically depend only on the frequency component as well or consider the two components separately (see, e.g., Denuit and Lang (2004); Bühlmann and Gisler (2005); Mahmoudvand and Hassani (2009); Bermúdez and Karlis (2011, 2017)).


A Unified Bayesian Framework for Pricing Catastrophe Bond Derivatives

Domfeh, Dixon, Chatterjee, Arpita, Dixon, Matthew

arXiv.org Machine Learning

Catastrophe (CAT) bond markets are incomplete and hence carry uncertainty in instrument pricing. As such various pricing approaches have been proposed, but none treat the uncertainty in catastrophe occurrences and interest rates in a sufficiently flexible and statistically reliable way within a unifying asset pricing framework. Consequently, little is known empirically about the expected risk-premia of CAT bonds. The primary contribution of this paper is to present a unified Bayesian CAT bond pricing framework based on uncertainty quantification of catastrophes and interest rates. Our framework allows for complex beliefs about catastrophe risks to capture the distinct and common patterns in catastrophe occurrences, and when combined with stochastic interest rates, yields a unified asset pricing approach with informative expected risk premia. Specifically, using a modified collective risk model -- Dirichlet Prior-Hierarchical Bayesian Collective Risk Model (DP-HBCRM) framework -- we model catastrophe risk via a model-based clustering approach. Interest rate risk is modeled as a CIR process under the Bayesian approach. As a consequence of casting CAT pricing models into our framework, we evaluate the price and expected risk premia of various CAT bond contracts corresponding to clustering of catastrophe risk profiles. Numerical experiments show how these clusters reveal how CAT bond prices and expected risk premia relate to claim frequency and loss severity.


How AI-powered tools are transforming the insurance industry

#artificialintelligence

The idea of AI-powered tools and technology transforming practices and revenue models within industries has gathered tremendous weight in the last couple of years. The reason -- artificial intelligence has backed the talk with the walk. It is becoming the real deal. Kind of like those who said the Internet was only a fad back in 1997 and it became one of the key difference makers in human history. Is the Internet still a fad?


Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

Chapados, Nicolas, Bengio, Yoshua, Vincent, Pascal, Ghosn, Joumana, Dugas, Charles, Takeuchi, Ichiro, Meng, Linyan

Neural Information Processing Systems

This conditional expected claim amount is called the pure premium and it is the basis of the gross premium charged to the insured. This expected value is conditionned on information available about the insured and about the contract, which we call input profile here. This regression problem is difficult for several reasons: large number of examples, -large number variables (most of which are discrete and multi-valued), non-stationarity of the distribution, and a conditional distribution of the dependent variable which is very different from those usually encountered in typical applications.of


Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

Chapados, Nicolas, Bengio, Yoshua, Vincent, Pascal, Ghosn, Joumana, Dugas, Charles, Takeuchi, Ichiro, Meng, Linyan

Neural Information Processing Systems

This conditional expected claim amount is called the pure premium and it is the basis of the gross premium charged to the insured. This expected value is conditionned on information available about the insured and about the contract, which we call input profile here. This regression problem is difficult for several reasons: large number of examples, -large number variables (most of which are discrete and multi-valued), non-stationarity of the distribution, and a conditional distribution of the dependent variable which is very different from those usually encountered in typical applications.of